rl vr
When Is Compositional Reasoning Learnable from Verifiable Rewards?
Barzilai, Daniel, Wolf, Yotam, Basri, Ronen
The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.
Escaping the Verifier: Learning to Reason via Demonstrations
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable. Recent advances in Large Language Models (LLMs) have been driven substantially by improvements in their reasoning abilities. Reasoning enables LLMs to perform deliberate intermediate computations before producing answers to the user queries, proposing candidate solutions and self-corrections. Much of this progress has been enabled via Reinforcement Learning (RL) on verifiable tasks such as mathematics and competitive programming (DeepSeek-AI et al., 2025; Y ang et al., 2025a; Shao et al., 2024; Luo et al., 2025). Notably, recent work has demonstrated that RL with V erifiable Rewards (RL VR) can enable LLMs to develop robust reasoning capabilities without any additional supervision (DeepSeek-AI et al., 2025). A growing body of work further improves the efficiency and stability of such RL algorithms on verifiable tasks, such as DAPO (Y u et al., 2025) and GSPO (Zheng et al., 2025). However, comparatively little attention has been paid to developing reasoning abilities on non-verifiable tasks, where task-specific verifiers are unavailable. Y et, in many impactful and challenging tasks -- such as analytical writing, open-ended research, or financial analysis -- LLM outputs are not directly verifiable due to hard-to-specify criteria, wide variation among acceptable answers, and other practical constraints. A popular approach in these settings is Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Rafailov et al., 2023), but they require collecting human preferences beyond demonstration data, which is often a time-consuming and expensive process.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
Rahman, Salman, Gorantla, Sruthi, Gupta, Arpit, Roy, Swastik, Peng, Nanyun, Liu, Yang
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision--achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT -4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth. Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, from achieving gold-medal performance at the International Mathematical Olympiad to autonomous agentic coding (Castelvecchi, 2025; Luong & Lockhart, 2025; Y ang et al., 2024b; Hurst et al., 2024; Anthropic, 2025). Recent breakthroughs like OpenAI's o1 and DeepSeek's R1 demonstrate that reinforcement learning (RL) post-training can significantly enhance reasoning capabilities beyond supervised fine-tuning alone (Jaech et al., 2024; Guo et al., 2025), as RL enables models to explore diverse solution paths and learn from feedback rather than imitation (Chu et al., 2025). While RL post-training shows promise, current approaches rely on verifiers that require ground truth references. Traditional methods rely on either discriminative verifiers that provide binary correctness signals (Cobbe et al., 2021) or rule-based verifiers using exact answer matching (RL VR) (Guo et al., 2025; Hu et al., 2025), both offering only sparse, outcome-level rewards. Recent advances introduce Process Reward Models (PRMs) that provide denser, step-level feedback to improve training stability and credit assignment (Lightman et al., 2023; Wang et al., 2024; Uesato et al., 2022), including co-evolving approaches like T ANGO (Zha et al., 2025) and PRIME (Y uan et al., 2024) that jointly train the verifier alongside the policy model. Work done while as an intern at Amazon AGI. Stage III: Apply trained PRMs in RL with GRPO using different reward designs. PRIME requires outcome-level correctness labels to train its PRM (Zha et al., 2025; Y uan et al., 2024).
- Europe > Austria > Vienna (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
PretrainZero: Reinforcement Active Pretraining
Xing, Xingrun, Fan, Zhiyuan, Lou, Jie, Li, Guoqi, Zhang, Jiajun, Zhang, Debing
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- (2 more...)
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
Alam, Md Tanvirul, Rastogi, Nidhi
Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
Cho, Dongkyu Derek, Song, Huan, Chowdhury, Arijit Ghosh, An, Haotian, Wang, Yawei, Thekkanal, Rohit, Sokhandan, Negin, Keshava, Sharlina, Marlowe, Hannah
This degradation persists across standard approaches including supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RL VR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RL VR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RL VR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety-capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yue, Yang, Chen, Zhiqi, Lu, Rui, Zhao, Andrew, Wang, Zhaokai, Yue, Yang, Song, Shiji, Huang, Gao
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
How LLMs Learn to Reason: A Complex Network Perspective
Hu, Sihan, Cai, Xiansheng, Huang, Yuan, Yao, Zhiyuan, Zhang, Linfeng, Zhang, Pan, Deng, Youjin, Chen, Kun
Training large language models with Reinforcement Learning with Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, a V-shaped response-length trajectory, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these behaviors are emergent collective phenomena governed not by neural implementation details, but by the topological evolution of the latent reasoning graph in semantic space. By demonstrating a dynamical isomorphism between a 1.5B-parameter LLM and a minimal Concept Network Model (CoNet), we trace the causal source to the self-organization of a sparse concept web pinned to an average degree of two. This geometric perspective provides a unified physical explanation for the observed anomalies: the V-shaped trajectory tracks the evolution from parallel local skill optimization to global network integration; catastrophic forgetting stems from the topological disconnection of critical ``trunk'' edges; and policy collapse arises from the accumulation of sequential transitions at the web's leaf nodes, where broad exploration abruptly freezes into rigid, high-reward trajectories. Identifying a ``maximally frustrated state'' at the transition between learning stages, we propose Annealed-RLVR, a principled algorithm that injects a targeted SFT ``heating'' step to resolve this topological bottleneck. Experiments confirm that this theory-driven intervention outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks (including Minerva and AIME). By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.
- Asia > China > Gansu Province > Lanzhou (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)
Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective
The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.57)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards
Liu, Yule, Zhang, Heyi, Zheng, Jinyi, Sun, Zhen, Peng, Zifan, Cong, Tianshuo, Yang, Yilong, He, Xinlei, Ma, Zhuo
Membership inference attacks (MIAs) on large language models (LLMs) pose significant privacy risks across various stages of model training. Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have brought a profound paradigm shift in LLM training, particularly for complex reasoning tasks. However, the on-policy nature of RLVR introduces a unique privacy leakage pattern: since training relies on self-generated responses without fixed ground-truth outputs, membership inference must now determine whether a given prompt (independent of any specific response) is used during fine-tuning. This creates a threat where leakage arises not from answer memorization. To audit this novel privacy risk, we propose Divergence-in-Behavior Attack (DIBA), the first membership inference framework specifically designed for RLVR. DIBA shifts the focus from memorization to behavioral change, leveraging measurable shifts in model behavior across two axes: advantage-side improvement (e.g., correctness gain) and logit-side divergence (e.g., policy drift). Through comprehensive evaluations, we demonstrate that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR. We validate DIBA's superiority across multiple settings--including in-distribution, cross-dataset, cross-algorithm, black-box scenarios, and extensions to vision-language models. Furthermore, our attack remains robust under moderate defensive measures. To the best of our knowledge, this is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that even in the absence of explicit supervision, training data exposure can be reliably inferred through behavioral traces.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)